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Abstract 

We present a global optimization algorithm for clustering data given the ratio of 
likelihoods that each pair of data points is in the same cluster or in different clusters. To 
define a clustering solution in terms of pairwise relationships, a necessary and sufficient 
condition is that belonging to the same cluster satisfies transitivity. We define a global 
objective function based on pairwise likelihood ratios and a transitivity constraint 
over all triples, assigning an equal prior probability to all clustering solutions. We 
maximize the objective function by implementing max-sum message passing on the 
corresponding factor graph to arrive at an 0{N^) algorithm. Lastly, we demonstrate 
an application inspired by mutational sequencing for decoding random binary words 
transmitted through a noisy channel. 
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1 Introduction 


Most algorithms for clustering data points determine clusters by minimizing in-cluster dif¬ 
ferences. In this paper, we consider the clustering problem wherein the data points are 
governed by two likelihood functions: fo{i,j) describing the probability that two data points 
i and j are from the same cluster, and describing the probability that i and j derive 

from different clusters. We use these two functions to assign a non-zero likelihood to any 
legal clustering conhguration. This likelihood function is a product of /o and /i terms over 
all pairs of data-points. We include with this likelihood a second term that constrains the 
pair-wise assignments of “same” or “different” such that same-ness is transitive: a necessary 
and sufficient condition for ensuring a legal clustering configuration. This constraint term, 
acting on all triples {i,j, k), determines a uniform prior on the space of all distinct clustering 
solutions. 

As in the case of affinity propagation [1] , we first describe the factor graph [2] determined 
by our likelihood function, and use max-sum message passing [3] to identify a clustering 
configuration that maximizes the posterior distribution given our observed data points. The 
result is a clustering algorithm that is 0{N^ x K) in complexity and 0{N^) in memory usage, 
where N is the number of data-points and K is the number of iterations to convergence. In 
our experience, convergence is rapid and K is typically very small. The optimal clustering 
solution is a minimal energy configuration such that points are in the same cluster when they 
experience a net attractive force and in different clusters when the net force is repulsive. This 
algorithm has the added benefit of not requiring an a priori number of clusters. 

In the next section, we calculate the posterior distribution whose maximization deter¬ 
mines the optimal clustering. In section 3, we describe the factor graph for this distribution 
and describe our algorithm based on message passing. In section 4, we consider a detailed ex¬ 
ample that illustrates the method, and in section 5, we conclude with a summary of results, 
some trivial extensions, and future directions in applying relational constraints in factor 
graphs. 
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2 Calculating the posterior distribution 


Notation 

Throughout this paper we will use the following notation. 

/ = {1, 2, •••, iV}, the data points, 

E = {{ij) \i ,3 e I, j, {i,j) = , the edges, 

T = {{i,j,k) I {i,j) e E, {j,k) G E, {k,i) e E}, the triples, 

/o = P{i,j I i and j are in the same cluster) 

/i = P{i,j I i and j are in different clusters) 

We consider the fully connected graph G with nodes / and edges E. We assign a color to 
the edges of G such that any edge is either blue = 0 or red = 1. The hypothesis matrix is a 
function H : —)■ {0,1}, 

0, i,j belong to the same cluster (blue edge) 

ij — 

1, i,j belong to different clusters (red edge) 

For any hypothesis matrix we can compute the likelihood as 


HI,H) = P{I\H)= n (1) 


We assume that every clustering is equally likely, equivalent to a uniform prior over all 
H obeying the transitivity condition. 


PprioriE} 



H represents a valid clustering 
otherwise 


( 2 ) 


Here is the Wth Bell number that counts the total number of partitions of N data 
points. H represents a valid clustering when every triple {i,jj k) & T satisfies the transitivity 
condition. The valid conhgurations for a single triple are shown in Figure [T] We can therefore 
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Figure 1: Valid configurations of hypothesis. Blue edge = 0 and red edge = 1. 



express the uniform prior as a product over all triples: 


Ppriori^H^ 1 r Vijk ; where 

Dm 

(ij,fc)er 

1 (i/,,-, Hui) = (1,1,1), (1,1, 0), (1, 0,1), (0,1,1), (0, 0, 0) 

0 = ( 0 , 0 , 1 ), ( 0 , 1 , 0 ), ( 1 , 0 , 0 ) 


^ijk 


(3) 

(4) 


For further details about the choice of prior and its consequences we refer the reader to 
Appendix 

The posterior distribution over possible hypotheses H can be calculated using the likeli¬ 
hood function and prior dehned above 


Pm)Ppr^or{H) 

^ EHPiI\H)Ppr^oriH)■ 


(5) 


The sum in the denominator is the sum over all possible H. The prior restricts 

the posterior distribution to valid clustering solutions. We dehne the optimal clustering as 
the hypothesis matrix H* that maximizes the posterior probability. 


H* 


argmax P{H\I). 

H 

argmax [logP(/|if) -h log Pprior{H)] 

H 


argmax 

H 




/l(bj) 

/o(h j) 


^ ^ lo§ ) Hjky Hki) 

{i,j,k)£T 


( 6 ) 

(7) 

( 8 ) 


In arriving at the hnal result we have dropped terms that are independent of FTp since they 
do not effect the result of the argmax operation. 

To simplify notation we dehne an objective function 


T{H) = 5 ^ + 5 ^ 

^ijk ( Hij 1 Hj k ) Hik ) 5 

{i,j)&E {i,j,k)&T 


( 9 ) 
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where Sij{Hij) := Hij log and 5ijk ■= ^og Vijk- 


Interpretation in terms of energy minimization 

We can define a Hamiltonian or an energy fnnction, 

^ = - ( 10 ) 

{i,j)£E {i,j,k)£T 

over the space of all matrices H. Note that S = —J^ + constant. The optimal clustering is 
defined as the minimum of this energy function. The terms log /o and log fi can be viewed 
as forces of attraction and repulsion. For a given pair of points i, j, if /o > /i then the 
energy is lowered if Hij = 0 or they are in the same cluster, and if /i > /o the energy is 
lowered when H^j = 1. In the absence of the prior term, the energy is minimized by the 
following solution 

0 , 

1, otherwise 

This solution is applicable when the data point clusters are well separated. Moreover, we 
have constructed this optimal solution through independent decisions for every edge. The 
prior complicates the problem and introduces a three-point long-range interaction term that 
is infinitely repulsive when the transitivity condition is disobeyed. However, if is 

consistent with transitivity, then it minimizes the energy and no further work is needed to 
identify an optimal configuration. 

In the next section, we represent the objective function as a factor graph and use 
message passing to determine the configuration that maximizes the objective function. 

3 Maximizing the objective function 

We can represent the objective function and its dependence on the hypothesis matrix H 
with a factor graph [2]. The factor graph consists of two types of nodes: variable nodes, 
represented by a circle, for every independent hypothesis variable in H, and function nodes, 
represented by a square for each summand in the objective function (|^. When a function 
node g depends on a variable x, we connect the nodes by an edge. Every variable node Hij 
has {N — 2) edges that connect it to function nodes 6ijk for all k ^ i, j; every function node 
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Figure 2: The factor graph for the objective function T defined in equation (|^ is composed 
of two types of junctions. On the left is the sub-graph of the neighbors of the variable node 
Hij and on the right, the sub-graph of neighbors of the function node bijk- 


6ijk is connected to three variable nodes Hij, Hjk and Hki] the function node Sij has only 
one edge to the Hij variable node. The factor graph is depicted in Figure]^ 

We use message passing on the factor graph to solve for H* = argmax^:^ J^. This technique 
has been applied to a variety of problems in different fields as discussed in |lj. Since the 
factor graph has cycles, our approach is an example of loopy belief propagation [3]. The 
success of this method has been explained in terms of the accuracy of the Bethe free energy 
approximation [5] . Every message is a two-tuple as every hypothesis variable has two possible 
values. We denote the message transmitted from Hij to 6ijk by Pij^ijk and the received 
message by aij^ijk as shown in Figure]^ Both messages are functions of the corresponding 
variable node Hij. The function node Sij continuously transmits the same message to Hij. 

The messages are updated as follows, first the variable nodes transmit to function nodes 


pij^ijk{Hij) — Sij{Hij) ^ ^ aij^iji{Hij) , 




( 12 ) 


and then receive responses 


^ij^ijk^Hij') rUc^ ^6ijk(^Hij y Hjky Hki) Pjk^ijki^Hjk') Pki^ijk{,Hki)^ ■ (13) 

This sequence of transmission and reception defines one iteration of the algorithm. At 
the end of each iteration, the configuration H* is given by 


H*j = argmax 

x={0,l} 


Sij{x) + 


ki^ij / 


(14) 
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Figure 3: As in Figure]^ we show the transmitted and received messages for Hij. 


We repeat, iterating through transmissions and receptions until H* is unchanged. 

The message update rules can be considerably simplihed. First, the Pij^ijk messages can 
be eliminated, such that we need only compute updates for aij^ijk- Second, the solution 
H*- only depends on the combination Aijk := so we do not need 

to calculate values for both states (blue and red) but only for the difference. Lastly, we 
introduce the auxiliary matrix Bij := AFjj + J2k^ij fhat reduces the complexity of the 
update procedure from 0{N^) to 0{N^). We refer the interested reader to the discussion 
in Appendix for details. Here, we show the result in the form of an explicit algorithm 
that we call Transitive Propagation, which has complexity 0{N^ x K) and 0{N^) memory 
usage, where K is the (typically small) number of iterations to convergence. 
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Algorithm: Transitive Propagation 

Data; N data-points with distributions /o,i(b j) for all i,j^I and A G (0,1). 
Result; Optimal hypothesis matrix H*. 

Calculate N x N matrix ASij = log/i(i,j) — log/o(i,j). 

Initialize N x N x N matrix Aijk ;= 0; 

Dehne convergence goal M = 1000; 
do 

Compute N X N matrix B : Bij = AS^j + ^ijk ; 

Compute update AAijk dehned by 

AAiji^ max {0, ^jk -^jki T ^ki max *1^0, ^jk Bj^^ Af^ij^ , 

Perform update including the dampening factor A; 


Aijk ^ (1 A) Aij]f T A AAij ]^, 


Calculate ABij = Y^k^ij ^^ijk and 


m = — min 


B. 




AR 


while 0 < m < M; 

Compute N X N matrix H*: H*- 


1, Ri>0 
0 , Bij < 0 


Convergence and dampening 

We have introduced a dampening factor A that helps the algorithm converge to a hxed 
point rather than a cycle. Small values of A promote convergence but also increase the 
running time of the algorithm. We hud that the choice A = 0.5 is a good balance between 
time to convergence and avoiding cycles. 

The entries in Aijk do not converge to hxed values, and this is to be expected because 
we do not normalize the messages after each iteration. The solution {H*j} only depends on 
the sign of the Bij matrix. Consequently, our convergence criterion is as follows; at each 
iteration we estimate the minimum number of iterations, m, it would take to change the sign 
of one entry in Bij and stop when the number of iterations reaches a dehned threshold, M. 





4 Example: clustering random bit patterns 


In this section, we present the clustering problem that inspired the development of transitive 
propagation. Recently, one of the authors proposed a method to uniquely tag DNA molecules 
through a process of random mutagenesis. By marking each template molecule with a random 
pattern, we can resolve two difficulties that continue to plague high-throughput short-read 
sequencing: (1) counting DNA molecules accurately and (2) assembling DNA sequences 
across repeat regions that exceed a read length. We do not discuss the details here, but refer 
the reader to the original paper [6]. 

The example we address in this section is an abstracted version of the hrst problem, 
known in the literature as the A'-populations problem, and has been shown to be NP-hard 
[7]. Assume we have K initial copies of a DNA sequence containing L mutable positions. 
Our mutation protocol randomly assigns one of two letters with equal probability at each 
position, generating K binary words of length L. These templates are copied many times 
and a machine analyzes those copies, outputting a read that matches the initial template’s 
binary word but introduces errors at a rate of Pe 1 per bit. Starting with N reads 
generated through this process, we would like to determine the number of initial templates 
K, assigning the reads to clusters that correspond to the same initial template. 

We work in a regime where N ^ K so that all templates are sampled and read by the 
sequencer. Since the error rate is low, we expect that the N reads form K clusters, where K 
is the unknown number of templates that we wish to determine. We begin by measuring the 
hamming distance dij between all reads i and j. When two reads are in the same cluster. 



(15) 


and when they belong to different clusters. 



(16) 


We generated K templates of length L = 30 bits and generated N = lOif reads by 


uniform sampling. We introduced errors at a rate of pe per bit. The results that we present 
were obtained by averaging over 100 simulations for various values of K and Pe- 

We performed computer simulations to evaluate our algorithm. We generated K random 
templates of length L = 30 bits for K = 10, 20,40. We simulated N = IQK reads with various 
error rates of Pe = 0.01, 0.05, 0.10, 0.15 per bit. Figure [^sliows the accuracy in determination 
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Figure 4: The number of clusters obtained as a function of the error rate for different template 
counts, which are shown as dashed lines. The results are averaged over 100 simulations and 
the error bars denote one standard deviation. 

of the template count as a function of the error rate averaged over 100 simulations. We see 
accurate recovery of the template count even at high error rates of pe = 0.05. 

Our algorithm is also very accurate in determining the correct clustering conhguration 
when the error rate is high. We hxed K = 50 templates of length L = 30 bits and generated 
N = 250 reads for various values of error rate pe = 0.01,0.05,0.10,0.20 and performed 100 
simulations. Our measure of accuracy is the number of edges that are mis-classified by the 
algorithm averaged over all the simulations. We plot the number of incorrect edges as a 
function of the hamming distance between the reads in Figure As a reference, we also 
plot the number of incorrect edges if we classified each edge i,j as red or blue based only on 
the likelihood ratio fi{i,j)/fo{i,j)- As expected, edges with very low or very high hamming 
distance are correctly inferred using both methods. For edges in the intermediate regime our 
method makes better inferences due to the transitive property. 
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Figure 5: The panel above shows the distributions /o (in blue) and fi (in red) for various 
values of the error rate pe- In the corresponding panel below we show the average number 
of incorrect calls made by our algorithm (green, solid line) and by classifying each edge i,j 
based on the likelihood ratio fi{i, j)/fo{i, j) alone. 

5 Discussion 

Transitive propagation is a useful algorithm for clustering data modeled by a balance of 
attractive and repulsive factors. By imposing a naive prior, the method uniformly explores 
the space of all partitions of the data-points, enforcing no a prior number of clusters or 
arbitrary similarity cut-off as required by other methods. As described in Appendix the 
naive prior does impose a non-uniform probability on the number of clusters. However, even 
this prior distribution may be tuned. 

The transitive propagation algorithm can be extended in the following ways. First, the 
existing algorithm implements max-sum message passing to identify a single conhguration 
that maximizes the likelihood. However, we can also implement sum-product message pass¬ 
ing to determine the marginal posterior probabilities that two data-points derive from a 
common cluster. Such an algorithm would allow the selection of only the most conhdent 
edges, so as to discard outlying data-points. Second, the existing framework assumes that 
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Figure 6: Panel A shows an ultrametric tree corresponding to the transitivity propagation 
algorithm. In panel B, we extend the ultrametric property to include four distinct levels, each 
with a likelihood function faihj) for a = 0,1, 2, 3 corresponding to multi-scale clustering. 

the / function depends on i and j such that all clusters follow the same distribution. This 
limitation can be overcome through the inclusion of node-specific clustering parameters that 
enable variation in the intra-cluster distributions. 

The methodology used in this paper to address clustering can be extended to other prob¬ 
lems that we leave to future work. First, the transitive constraint may be considered as the 
hrst non-trivial example of an integer valued ultra-metric which assumes one of two values: 
0 or 1. In this formulation, the prior constraint on Vijk in equation]^ is identical to the ultra¬ 
metric property. We can extend to higher order clusters by allowing a family of likelihood 
functions, fa for a = 0,1,..., L, that measure increasingly divergent relationships between 
nodes (see £gure[^ and allow Hij to assume values of 0, 1, ... L. This modihed algorithm 
enables multi-scale clustering. Second, we can apply the same framework of constrained 
optimization to enforce relationships other than equality. For example, we may have data- 
points that obey a partial ordering. The same constraints apply to H^j as before, however it 
is no longer the case that Hij = Hji. Depending on the nature of the data, the optimization 
function may depend on the four possible states for the pair {Hij, Hji) equivalent to the four 
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possible cases: (1) i and j are the coincident, (2) i precedes j, (3) j precedes i, or (4) there 
is no relation between i and j. 
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A Some observations about our choice of prior 


We can construct a family of conjugate priors for the problem of clustering N data points 
parameterized by a real N x N matrix X = [Xij], 

Pconi{H\X,N) = ^^ n If (1’’) 

where ^ is a normalization factor. With this choice of prior, the posterior distribution in 
equation (j^ becomes 


PiH\I,X,N) = P,,njiH\X',N), 


(18) 


where Xh = X^j + \ogfi{i,j) — log/o(i,j). In this section, we study a one-parameter sub¬ 
family given by Xij = x, where a; is a non-negative real number. 


F{H,x,N) 


1 

Z{x,N) 


n n 

ii,j,k)eT 


(19) 


The uniform prior introduced in ([^ is a member of this one-parameter family, Pprior{H) = 
F{H,x = l,N). The function Z{x,N) is the overall normalization and is usually called the 
partition function 


Z(x,X) = (20) 

H 

where the Hamiltonian 'H{H,x,N) = Y.(i,j,k)&T Hjk, Hik) + Y.^^.^^p,Hij\ogx is an 

example of a spin Hamiltonian where the Hij can be viewed as “spin” degrees of freedom 
and the parameter log a; is the applied magnetic held. However, rather than the usual 
pairwise spin-spin interaction we have a 3-spin 6ijk term. We study the phase diagram of 
this Hamiltonian as a function of x and hnd an order-disorder transition at a critical value 
of a; = Xcritical where xcritical — 1 ~ We suspect that this system has been studied 

in the vast literature on spin Hamiltonians and spin glasses but we are not aware of this. 

Alternatively, it can be written as a sum over conhgurations satisfying the transitivity 
constraint 


Z{x,N) = 

c 


( 21 ) 
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where C runs over all possible partitions of N data points into clusters and h{C) is the 
number of blue edges. When a; = 1 we recover the uniform prior (|^; and Z{x = 1, N) = Bn 
where Bn are the Bell numbers that enumerate the total number of partitions of a set of N 
elements. The limiting behavior is 

to Z(x. N) = l. Jim = 1 . (22) 

The partition function Z{x,N) satishes a recurrence relation 

^ //V\ 

Z{x, N + 1) = J2[ }.) Z{x, N - k), (23) 

Z{x,0) = Z{x,l) = 1, (24) 


which can be derived using the principle of induction. This relation can be used to compute 
values of Z{x,N) numerically. 

Intuitively, the effect of the parameter x is to favor or disfavor clustering conhgurations 
based on the number of blue edges. This can be quantihed by calculating the number of blue 


edges averaged over the space of clustering conhgurations using the prior distribution (19) 


ib) = x-^ log Z(x, N). 
ax 


(25) 


The behavior of the blue edge fraction is shown in Figure We see a phase transition, which 
in the iV —)■ oo limit is a discontinuity at a: = 1. Phase transitions of this sort occur in the 
large N limit and arise when there is a balance between entropic and energetic considerations. 
When X = 1 + e, e > 0, there is an exponentially larger weight associated with conhgurations 
with more blue edges. In the large N limit, logZ(l + e,N) = N{N — l)/21og(l + e), which 
is the contribution to the partition function from the conhguration with all blue edges. This 
is the ordered phase. We estimate the entropy associated with the number of clustering 
conhgurations in the large N limit as logZ(l,iV) ~ A^logA^. The balance gives us an 
estimate of the location of the phase transition as 


Xcritical = 1 + e, e ~ — log iV . 


( 26 ) 


The family of priors in (19) imposes a non-uniform prior on the number of clusters. To 
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Figure 7: The expected value of the fraction of blue edges averaged over the ensemble of 
clustering configurations depicted as a function of the parameter x for various values of N. 


calculate expectation values we add another parameter A to the partition function 


Zx{x,N) = 

c 


(27) 


where n{C) is the number of clusters, and the sum is over all clustering configurations. 
Clearly, Za=i(x, N) = Z{x, N), and taking derivatives with respect to A allows us to calculate 
moments 




< n >= 


d 

dX 


Zx{x,N) 


A=1 


<n^ > - <n >^= 



Zx{x,N) 


A=1 


(28) 

(29) 
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Figure 8: The line shows the mean number of clusters as a function of x. The shading 
indicates one standard deviation on either side. The plot is for N = 300 points. 

The function Z\{x,N) can be calculated using the recurrence relation 

^ / N\ 

Z,{x, N + l) = Xj2i^^j z,{x, N - k), (30) 

Za(x, 0) = 1, 2'a(x, 1) = A. (31) 

The prior distribution is peaked over configurations with a definite number of clusters as 
shown in Figure]^ 

We note that the parameter x can be tuned by the user between 0 and 1 in order to 
influence the outcome of the clustering based on prior knowledge either about the number 
of clusters or the fraction of blue edges. We recommend the uniform prior corresponding to 
the choice x = 1 that weighs all clustering configurations equally. 


B Simplifying the message update equations 

We recall here the message update equations from section 
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( 32 ) 


Pij^ijk{Hij) — Sij{Hij) + ^ ^ aij4-iji{Hij) , 

C^iji—ijki^hij') max Hjl^y -\- Pfci—^jj7c(-^A:i)] • (22) 


^jki^ki 


Simplification 1: eliminate p 

Since the messages pij^ijk play no role in determining the solution H*, they can be eliminated 
giving a single update for the message aij^ijk, given below 

C^ij<—ijk(^Hij') max ( 5ijk ~\~ ^jki^Hjk') ^ ^ CXjki—jkli^Hjk') “1“ Skii^Hki) ^ ^ C^kii—ikli^Hki') j -(24) 


Simplification 2: Only the difference of messages matters 


Equation (14) can be rewritten as 


— Step I S'jj(O) + [^ij<-ijk{^) O(ij-i-ijk{0)]j , 

k^i,j 


Step(a;) = 


1 if a: > 0 
0 otherwise. 


(35) 

(36) 


Note that H* is only dependent on the differences 


^Sij . 5*^j(0) , ^ijk • zjfc(0) • 


(37) 


Moreover, as we shall see shortly, the update for the difference Aijk is completely determined 
by the value of A^jk alone. 


Equation (34) can be written explicitly as 


ijfc(l) max I Sjki^Hjk') 4” / ) Cyjki—jkli^Hjk') Skii^Hki') 4” / ) (yki<—ikl(^Hi 


ki j 


SAO) + E jfcz(O) 4“ >S*fcj(0) 4“ ^ ^ Oiki^^ikl(fi') 




l^i,j,k 


max 


< 0, AS'jfc 4- E Ajkli A^Ski 4“ ^ ^ Akil / 

L l^i,j,k l¥=ij,k J 


(28) 
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In the first step the Sijk either contributes —oo or nothing at all. Since the —cx) contribu¬ 
tion never wins in the niax() function those conhgurations of Hij, Hjk, Hki are effectively 
eliminated. Similarly, 


p'A: (0) 


sim + E j;fcz(0) “|- S'/jj(0) -|- ^ ^ ifcZ(0) 


lj^i,j,k 


l^i,j,k 


+ max 


< 0, ^Sjk -l- E -^jkl + ^Ski E Akii [■ 


(39) 


Taking the difference of equations (39), (38), we arrive at an update for the Aij^. 


Aijk ^ max 
max 

In each iteration of the algorithm we have to update 0{N^) variables Aijk, each of which 
involves a sum over 0{N) terms. This makes the complexity 0{N'^). The run time scal¬ 
ing with N can be improved to 0{N^) by computing the summations ahead of time. We 
introduce the matrix Bij := ASij + Yhkj^i j terms of which the update to the Aijk 

becomes 


0 , ASjk + Yji^i,j,k ^jki + ^Ski + Y.i^i,j,k - 

0, ASjk + ASki + 


Aijk ^ maX'KO, ^jk Ajki Bki Akij'\ max ^0, ^jk Ajki^ ^ki Akij^ , (^0) 

and the best conhguration is obtained by 

= Step(i?,,). (41) 
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